AITopics | asr text

Collaborating Authors

asr text

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Context-Enhanced Granular Edit Representation for Efficient and Accurate ASR Post-editing

Vejsiu, Luan, Zheng, Qianyu, Chen, Haoxuan, Han, Yizhou

arXiv.org Artificial IntelligenceSep-19-2025

Despite ASR technology being full-scale adopted by industry and for large portions of the population, ASR systems often have errors that require editors to post-edit text quality. While LLMs are powerful post-editing tools, baseline full rewrite models have inference inefficiencies because they often generate the same redundant text over and over again. Compact edit representations have existed but often lack the efficacy and context required for optimal accuracy. This paper introduces CEGER (Context-Enhanced Granular Edit Representation), a compact edit representation that was generated for highly accurate, efficient ASR post-editing. CEGER allows LLMs to generate a sequence of structured, fine-grained, contextually rich commands to modify the original ASR output. A separate expansion module deterministically reconstructs the corrected text based on the commands. Extensive experiments on the LibriSpeech dataset that were conducted, CEGER achieves state-of-the-art accuracy, achieving the lowest word error rate (WER) versus full rewrite and prior compact representations.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.14263

Country:

Asia (0.68)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.79)

Add feedback

Adversarial Attacks to Multi-Modal Models

Dou, Zhihao, Hu, Xin, Yang, Haibo, Liu, Zhuqing, Fang, Minghong

arXiv.org Artificial IntelligenceSep-10-2024

Multi-modal models have gained significant attention due to their powerful capabilities. These models effectively align embeddings across diverse data modalities, showcasing superior performance in downstream tasks compared to their unimodal counterparts. Recent study showed that the attacker can manipulate an image or audio file by altering it in such a way that its embedding matches that of an attacker-chosen targeted input, thereby deceiving downstream models. However, this method often underperforms due to inherent disparities in data from different modalities. In this paper, we introduce CrossFire, an innovative approach to attack multi-modal models. CrossFire begins by transforming the targeted input chosen by the attacker into a format that matches the modality of the original image or audio file. We then formulate our attack as an optimization problem, aiming to minimize the angular deviation between the embeddings of the transformed input and the modified image or audio file. Solving this problem determines the perturbations to be added to the original media. Our extensive experiments on six real-world benchmark datasets reveal that CrossFire can significantly manipulate downstream tasks, surpassing existing attacks. Additionally, we evaluate six defensive strategies against CrossFire, finding that current defenses are insufficient to counteract our CrossFire.

crossfire, dataset, image feature, (16 more...)

arXiv.org Artificial Intelligence

2409.06793

Country:

North America > United States > Utah > Salt Lake County > Salt Lake City (0.05)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
North America > United States > Texas > Denton County > Denton (0.04)
(5 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.84)
Leisure & Entertainment > Sports > Tennis (0.30)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
(2 more...)

Add feedback

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

Zhao, Ruixiang, Jia, Jian, Li, Yan, Bai, Xuehan, Chen, Quan, Li, Han, Jiang, Peng, Li, Xirong

arXiv.org Artificial IntelligenceAug-6-2024

E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.

asr text, representation, retrieval, (14 more...)

arXiv.org Artificial Intelligence

2408.02978

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

Boosting Punctuation Restoration with Data Generation and Reinforcement Learning

Lai, Viet Dac, Salinas, Abel, Tan, Hao, Bui, Trung, Tran, Quan, Yoon, Seunghyun, Deilamsalehy, Hanieh, Dernoncourt, Franck, Nguyen, Thien Huu

arXiv.org Artificial IntelligenceJul-24-2023

Punctuation restoration is an important task in automatic speech recognition (ASR) which aim to restore the syntactic structure of generated ASR texts to improve readability. While punctuated texts are abundant from written documents, the discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap. The experiments show that our method achieves state-of-the-art performance on the ASR test set on two benchmark datasets for punctuation restoration.

machine learning, pr model, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2307.12949

Country:

North America > United States > California (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
North America > United States > Oregon (0.04)

Genre: Research Report (0.50)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.86)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

Add feedback